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1. INTRODUCTION 

Ensemble learning is a part of machine learning that can produce better performance in predicting or 
classifying data patterns [1]. Technically, this method combines several machine learning algorithms to avoid 
the risk of underperformance by reducing bias and variance [2]. Generally, each algorithm holds its own 
characteristic and serves different advantages and disadvantages. The differences in the characteristics of each 
algorithm lead to differences statistically, computational, and representational, thus triggering this ensemble 
learning approach [3]. 

Ensemble learning has different methods in order to improve accuracy and stability. The methods 
show different way in their approaches such as 1) bagging [4] which generates several versions of predictor 
to achieve a result of an aggregated prediction; 2) boosting [5] which runs the weak learning algorithm on 
different distributions over training data before merging the classifiers pin a single composite classifier; and 
3) voting [6] as the simplest among the others that combine all algorithms including minimum probability, 
maximum probability, majority voting, product of probability, and average of probabilities. This study will use 
those three methods to reduce the bias in performance between decision tree, Naive Bayes, and support 
vector machine algorithms (SVM). 
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Koutanaei et al. [7] conducted a research about ensemble learning method and feature selection 
algorithms for credit scoring. This research analyzed artificial neural network (ANN), classification and 
regression tree (CART), Naive Bayes, and SVM. Apart from using bagging and AdaBoost this study added 
two more methods in ensemble learning, which are stacking and random forest. Theoutcome of this research 
is ANN- AdaBoost algorithm classified as the best classifier for credit scoring, meanwhile Naive Bayes and 
SVM- AdaBoost are the worst classifiers. 

A research by Ankit and Saleena [8] studied about classification system for twitter sentiment analysis. 
This research used a weighted ensemble classifier as a proposed ensemble classifier to combine base learners 
and form a single classifier. This method was compared with the individual classifier and majority voting 
ensemble classifier. The algorithms that used are Naive Bayes, random forest, SVM, and logistic regression. 
The outcome of this research showed that the proposed classifier outperformed the other methods. 

Onan et al. [6] conducted a research about latent dirichlet allocation (LDA) based topic modelling in 
text sentiment classification. This research used LDA and Naive Bayes, SVM, logistic regression, radial basis 
function network, and K-nearest neighbor algorithms in empirical analysis. For the ensemble learning method, 
this research used bagging, AdaBoost, random subspace, voting, and stacking. The outcome of this research 
showed that stacking ensemble got the highest performance score. 

Anshary and Trilaksono [9] conducted a research about target market classification using ensemble 
method. This research used CART to perform ensemble methods. The methods that used are bagging and 
boosting. Using a total of 3000 dataset from a specific account that have 200,000 followers, the datasets were 
divided into 4:1 of training and testing data. The result of this research showed that bagging increased the 
value of precision by 1.9% and provide the highest performance score among those three. 

Fouad et al. [10] conducted a research about sentiment analysis using feature selection and classifier 
ensemble. This research used voting ensemble method to search majority decision among SVM, Naive Bayes, 
and logistic regression. The outcome of this research showed that voting ensemble can outperform the other 
classifier in two datasets. Meanwhile SVM and logistic regression outperform others in one dataset 
respectively. 

This study aims to increase the performance of classifier and also to reduce the bias in classifying 
sentiment. Three classifiers (decision tree, Naive Bayes, and support vector machine) are compared in 4 
condition; 1) three classifiers performance score using cross validation; 2) bagging version of each classifier; 
3) AdaBoost version of each classifier; 4) an ensemble version of three classifiers using voting method. Using 
1670 data collected from twitter, those 4 conditions are used to conclude which one is better in overall 
performance score. 

This paper is divided into five sections. Section 1 states the background, purpose of study and related 
research. Section 2 states the research methodology that is used in this study. Section 3 states the result and 
discussion of this study. And the last section states the conclusion and future works regarding this study. 


2. RESEARCH METHOD 

This research diagram was including 3 phases, which are 1) Data preprocessing which include 
retrieving data from twitter and pre-processing conducted; 2) Training and testing which implemented three 
classifiers (decision tree, support vector machine, and Naive Bayes) to individual classification and ensemble 
classification (including bagging, AdaBoost, and voting); 3) and lastly parameter measurement uses 5 
measurement including accuracy, precision, recall, f-measure, and ROC curve. The diagram was shown in 
Figure 1. 


2.1. Data collecting and pre-processing 

From 3000 data that collected from twitter, only 1670 were passed the manual classification as only 
those data that passed the feasibility labelling. 1670 data consist of tweets regarding Ancol’s tourist attraction 
queries, such as ‘ancol’, ‘dufan’, ‘seaworld’, and ‘ocean dream samudra’. After collecting those data into 
comma separated values (CSV) format, pre-processing phase was conducted. The values for sentiment are 
positive and negative. 

Pre-processing phase start from 1) transforming all letters into lowercase (case folding); 
2) separating words into tokens and removing all punctuations and whitespace (tokenizing); 3) removing 
unnecessary words using stopwords dictionary (stopwords removal); 4) and lastly transforming all tokens into 
their base word (stemming). Stopwords dictionary used dictionary from Tala [11] that consisted Indonesian 
words of stopwords. This dictionary already used as reference a lot for others pre-processing phase. 
Meanwhile for stemming, Sastrawi dictionary [12] was used in form of regular expression (regex). 


TELKOMNIKA Telecommun Comput El Control, Vol. 19, No. 5, October 2021: 1747 - 1754 


TELKOMNIKA Telecommun Comput El Control O 1749 


Data Collecting and Preprocessing 


Training and Testing 


Training datasets 
Dedsion Support Vector Naive Bayes 
Tree Machine 


Ensemble Method 


Bagging 





Adaboc t 





Tes ting datasets 


Final Result 


Parameter Measurement 


1. ACCLY acy 
2. Precision 
3. Recall 

4, F-measure 
3. ROL Curve 





Figure 1. Research methodology 


2.2. Testing and training 

This research applies cross validation to divide dataset into testing and training dataset. Cross 
validation is a technique that divides samples into k subsets of the same size. In range of 2-10 k-folds, the least 
number of iterations would be 1 meanwhile the most would be 9 for training phase. Meanwhile 1 single 
subset is used as testing dataset. 


2.3. Ensemble method 

Ensemble methods are used to reduce bias and increase performance score for classifier. In other 
words, to solve the problem that a single classifier faced. Ensemble methods combine the outputs of base 
classifier to boost up the performance score [13]. In training phase, three standalone classifiers (decision tree, 
support vector machine, and Naïve Bayes) are generated into ensemble learning as a base of learning 
algorithm. In bagging method, the datasets are used randomly [14] before they are combined using majority 
voting as the final classification in testing phase. Meanwhile AdaBoost creates base classifiers sequentially 
by weighting through iterations [7] and later the weighting is adapted by base classifier’s misclassification in 
testing phase. Lastly voting method uses majority decision collected from three base classifiers. As one of the 
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crucial problems of machine learning lies in its minimum error function, ensemble method could set ‘an 
average’ that will reduce the risk of choosing wrong classifier [9]. Thus, ensemble learning produces base 
classifiers then combine them to get better result. 


2.3.1. Decision tree 

Decision tree 1s 1s a classifier which each internal node of the tree represents a condition on a feature 
of model, each branch is the output of the previous condition, and each leaf reflects the predicted class from 
the algorithm [15]. Decision tree is a classification algorithm that determines the method of decision making 
based on samples or certain criteria and classes. In other words, the decision tree eliminates unnecessary 
calculations in classification and seems flexible because it can select features from different internal nodes. 
Decision node takes an action to select of the edges stemming. The edges stemming is selected randomly in 
chance note, meanwhile the terminal nodes will represent the end of the actions. 


2.3.2. SVM 
Support vector machine (SVM) 1s a classifier defined by a separating hyperplane, which labeled data 
training as output categories [16]. SVM algorithm stated as [17] below: 
— A classifier for a binary classification will be symbolized as y (labels) and x (features) to denote the class 
labels and parameters w (normal to the line) and b (bias) as stated in formula 1. 


(x) =w'x +b (1) 


— Then SVM will be represented by a separated hyperplane f(x) that geometrically bisects the data space 
into two diverse regions thus resulting in classification of the input data space into two categories. 

— The function f(x) denotes the hyperplane in classification of data set, then the two regions created by the 
hyperplane correspond to the two categories of data under two class labels. 

— Let the class labels that needs to be assigned to the data vectors to implement supervised classification be 
denoted by yi, which is +1 for one category of data vectors and -1 for the other category of data vectors. 


2.3.3. Naive Bayes 
Naive Bayes represents a probabilistic model that allows to capture uncertainty by determining the 
outcomes probabilities [18]. Naive Bayes generates probabilities using formulas below: 


likelihood x prior probability 


evidence 


probability = 


2.3.4. Bagging 

Bagging is a parallel method, where it generated base learners in parallel while concerned with 
reducing variance and obtaining good generalization ability [19]. Bagging uses dataset in random and 
combines them using majority voting. Bagging generates several versions of predictor to achieve a result of 
an aggregated prediction [20]. In bagging method, the datasets are used randomly [14] before they are 
combined using majority voting as the final classification in testing phase. Bagging is fit perfectly in 
classification issues [21]. The pseudo-code [22] of bagging is shown in Figure 2. 


Input: traming sample S, Classifier L, iterations J 
Training: 
For t= ior 
S; = bootstrap sample from S 
L; = train a classifier on S; via L 
End for 
Lg = arg max Xi:Lœ)=y 1 


Output: result Lg 


Figure 2. Pseudo-code of bagging 
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2.3.5. Boosting 

Boosting’s most popular algorithm, AdaBoost was introduced through Freund and Schapire [5] 
which settled as the sequential learning method. This is also one of the famous boosting method which 
creates base classifiers by training weighted transactions through iterations. Similar to bagging, [23] boosting 
main goal is to classify using average of numeric estimation to base classifier model’s output. Boosting also 
uses new influenced models and reinforces new model to promote them to become experts. The pseudo-code 
of boosting [22] is stated in Figure 3. 


Input: Data set D = {(2,, Y1 h (xg, Fa) es Ot Vd 
Base learning algorithm L; 


Number of learning rounds T. 


Process: 

D=ifm t Initialize the weight distribution 
For t= L.-T; 

h, = LCD, D,); t Initialize the weight distribution 

€, = Prp [helr + yo} % Measure the error of h; 

a, =mi, 

- 3S ee 
-, DED expl-at)if he (x0=yi 
Dra (i) = Tt X explatiif ht (xieyi 
BHD exp- athihtiæii : Jai : : are 
= Se % Update the distribution, where £; is a normalization 
to factor which enables D,,, to be a distribution 

End. 


Output: A(x) = sign(f(x)) = sign NE, atht(x) 


Figure 3. Pseudo-code boosting 


2.3.6. Voting 

Voting method generates their prediction by forming the overall ensemble prediction [3]. Voting 
combines all base classifiers while including minimum and maximum probabilities, majority voting, product 
of probability, and average of probabilities [6]. Pseudo-code for voting [24] is stated in Figure 4. 


Input: 
T = {(x1, y1), (x2, yo), --- (xn, Ynlh xi E X, yi E Y 
¥ = {l,...c};F = [f mk 

Out:Hix) 


1) Initialize the class weights: 


WILO  --- Wie 
W = PEE aak eatin 
Wal ---  Wme 


w= l;i = [1...m}]; j= (1,4) 
2) Fori from Í tem 
(a) Fit a classifier f(x) to the training data 
(b) Calculate possibility for fi (xaje = j} wy = 
J iv eee 





3) Using all classifiers to predic: classi- 
fer predicti Test_x), 

Calculate possibility for one record to belong class c, 
pi = E waihi (x) =c) 

5) Output: 


Af (x) = argmax paw pifi 0) == 5) 


4 


lor! 


Figure 4. Pseudo-code of voting 


2.4. Parameter measurement 

Using confusion matrix table reference, there are 4 components consist of 1) true positive (TP) as 
true positive tuple classified; 2) true negative (TN) as true negative tuple classified; 3) false positive (FP) as 
positive tuple classified as negative; and 4) false negative (FN) as negative tuple classified as positive. In 
determining which classification algorithm has the best performance value, it can be seen from the minimal 
difference between the precision and recall measurement [25] If a classifier has a high value on both 
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measurement value, it means that the classification algorithm is not biased (balanced between the true positive 
rate and the false positive rate). To find out this measurement, the receiver operating characteristic (ROC) 
curve is used by showing a visual comparison of accuracy between the true positive rate (TPR) and the false 
positive rate (FPR). ROC has five accuracy levels among them are: 

— 0.90 — 1.00 = Excellent classification 

— 0.80 — 0.90 = Good classification 

— 0.70 — 0.80 = Fair classification 

— 0.60 — 0.70 = Poor classification 

— 0.50 — 0.60 = Failure 


3. RESULTS AND ANALYSIS 

Each classifier has different best accuracy score according to its k-folds. To summarize it, from 2-10 
range, this study used k-fold 2, 6, and 9 as those 3 hold each classifier’s highest accuracy score. Furthermore, 
it explained in Table 1. Based on the accuracy shown in Table 1, the highest accuracy score is shown on 
standalone classifier and AdaBoost method, seen by decision tree highest score both on k-folds 9 with 
88.14%. The measurement for precision, recall, f-measure, and area under curve (AUC) are shown in 
Figures 5-8. 


Table 1. Comparison of accuracy measurement 


K- Accuracy 
Folds Bagging Bagging Bagging AdaBoost AdaBoost AdaBoost 
SVM NB DT SVM NB DT SVM NB DT Voting 
2 85.69% 68.92% 87.90% 85.75% 69.52% 87.66% 85.69% 68.92% 87.90% 87.25% 
6 86.77% 65.21% 87.90% 87.07% 65.81% 87.43% 86.83% 65.21% 87.90% 88.02% 
9 86.77% 64.79% 88.14% 86.53% 64.85% 88.08% 86.83% 64.79% 88.14% 88.02% 
120.00% 99,14% 99.71% 99.57% 99.64% 99,14% 99.71% 99.14% 
30.00% i I lä 74.70% ii 71.20% 
60.00% | 
40.00% 
20.00% 
0.00% 
z 2 2 5 P 
A Wi ay Wi j A = 
a £ 5 T = Q > 
= B tah = E Š 
Bo fae] fa) 
$ s f & 
4 
Precision 
Figure 5. Precision measurement 
100.00% 37.07% 95.07% 96.51% 94.82% 
30.00% 87.02% 87% 87.83% 86.22% 26.36% oe 
85.00% 
z, MAHN N m EE) 
75.00% : 
Ss 2¢ 6 € 2 = 2 6 ¢& 
rs = : Hi a 5 $ 
= "Bo Bo A z z 
E: 0 oo = = T 
o i i T 5 T 
ñ E = q 


Recall 


Figure 6. Recall measurement 


As the summary of accuracy measurement shown in Table 1, it can be seen that both decision tree as 
a standalone classifier can against decision tree with AdaBoost method. Thus, for accuracy there is no 
specific improvement between standalone classifiers and ensemble classifiers. In Figure 5, it can be seen that 
the highest precision score shown in both decision tree and decision tree with AdaBoost method too, so the 
same conclusion applied to this measurement too. In Figure 6, the highest score for recall is shown in support 
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vector machine with bagging method. This showed that bagging method gave quite big improvement from 
87.02% (standalone support vector machine) to 97.07%. Meanwhile for F-measure, in Figure 7 it stated that 
the highest score is also shown in support vector machine with bagging method with quite big improvement 
from 92.61% to 98.30%. And for the last measurement in Figure 8, it showed that support vector machine 
with bagging method also has the highest score for AUC measurement with 0.894 and classified as good 
classification. 





97.61% gagy 98. a g7. 20% g7. 10% ar g7. 21% 93, 26% 
= 5.05% 
co per 

2 z a = z È 

in D ao z 02 the tf z 
bo = = 2. « © > 
& ety a + 2 g 
a è @ a 4 4 4 
¢ a d 3 g 
= q 

F-measure 


Figure 7. F-measure measurement 
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Figure 8. AUC measurement 


4. CONCLUSION 

Ensemble method does not affect accuracy and precision measurement that much, but can affect 
recall, f-measure, also AUC quite a lot since the biggest improvement from this study shown in the 
measurement of recall, f-measure and AUC. For overall performance score, support vector machine with 
bagging method outperforms other classifiers in term of recall, f-measure, and AUC measurement. 
Meanwhile decision tree (both standalone and AdaBoost method) outperform other classifiers in term of 
accuracy and precision. Voting method in fact, does not stand out in comparison with other ensemble 
classifiers. For future works suggestion, this study can improve by adding more datasets and also other 
ensemble methods. Also, since this study only included ‘positive’ and ‘negative’ sentiment, the addition of 
‘neutral’ sentiment can be added to see if it will affect the performance measurement. 
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